On developing new text and audio corpora and speech recognition tools for the turkish language
نویسندگان
چکیده
This paper describes recent work towards development of new corpora and tools for Turkish speech research. This effort represents an on-going collaboration between the Center for Spoken Language Research (CSLR) at the University of Colorado and the Department of Electrical Engineering at the Middle East Technical University (METU). A new text corpus developed from Turkish newspapers’ text is described. In addition, a 193-speaker audio corpus and pronunciation lexicon for the Turkish language is developed. We then describe our initial work towards porting Sonic, the CSLR speech recognition system, to the Turkish language. Results are shown for phonetic alignment and phoneme recognition accuracy using the newly constructed corpus and speech tools. It is shown that 91.2% of the automatically labeled phoneme boundaries are placed within 20 msec of hand-labeled locations for the Turkish audio corpus. Finally, a phoneme recognition error rate of 29.3% is demonstrated.
منابع مشابه
Concordancing for parallel spoken language corpora
Concordancing is one of the oldest corpus analysis tools, especially for written corpora. In NLP concordancing appears in training of speech-recognition system. Additionally, comparative studies of different languages result in parallel corpora. Concordancing for these corpora in a NLP context is a new approach. We propose to combine these fields of interest for a multi-purpose concordance for ...
متن کاملMylly - The Mill: A New Platform for Processing Speech and Text Corpora Easily and Efficiently
Speech and language researchers need to manage and analyze increasing quantities of material. Various tools are available for various stages of the work, but they often require the researcher to use different interfaces and to convert the output from each tool into suitable input for the next one. The Language Bank of Finland (Kielipankki) is developing an on-line platform called Mylly for proc...
متن کاملSpeech Recognition and Information Retrieval: Experiments in Retrieving Spoken Documents
The Informedia Digital Video Library Project at Carnegie Mellon University is making large corpora of video and audio data available for full content retrieval by integrating natural language understanding, image processing, speech recognition and information retrieval. Information retrieval of from corpora of speech recognition output is critical to the project’s success. In this paper, we out...
متن کاملEndangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR
This project approaches the problem of language documentation and revitalization from a rather untraditional angle. To improve and facilitate language documentation of endangered languages, we attempt to use corpus linguistic methods and speech and language technologies to reduce the time needed for transcription and annotation of audio and video language recordings. The paper demonstrates this...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کامل